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1 Introduction 
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A compositional data vector is a special type of multivariate observation in which the elements of the 
vector are non-negative and sum to a constant, usually taken to be unity. Data of this type arise in many 
fields including geology, archaeology, biology and economics. In mathematical terms, the relevant sample 
space is the standard simplex, defined by 
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The following two approaches have been widely used for compositional data analysis. The first is to 
neglect the compositional constraint and apply standard multivariate data analysis techniques to the raw 
data, an approach we will call raw data analysi s (RDA). For discussion and examples of RDA see for 
example Baxter (Il995l l200lh : lBaxter et all (|2005h : Baxter and Freestone! l|2006l ). RDA is also adop t ed by 
(ISriyastava et al.l . 120071 ) in biological settings. The second approach, introduced bv lAitchisonl ([ 19821 . 1 19831 . 
19861 ). is based upon transforming the data using log-ratios, an approach we will call log-ratio analysis 
(LRA). The motivation behind LRA is that compositional data carry only relative information about the 
components and hence working with log s of ratios i s appr o priate for analysing this information. A popular 
log-ratio transformation, sugg ested by lAitchisonl ljl98.4 Il98fih . is the centred log-ratio transformation 
defined by 
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where <?(x) = ( Y[j=i x j ) ^ s the geometric mean of the D components of the composition, 
transformation maps the standard simplex ([I]) onto a d-dimensional subspace of R D , given by 
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The so called isometric log-ratio transformation ( Egozcue et al. . 20031 ) is then given by z = Hy, where H 
is a d x D orthonormal matrix whose rows are orthogonal to Id, the D- vector of ones. A stan dard choice 
of H is the Helmert sub-matrix obtained by removing the first row from the Helmert matrix (jLancasterl . 
19651 ). 



O yer the years there has been a debate over whether it is preferable to use RDA or LRA (jBarcelo et al 



19961 : lAitchisonl . Il999l : lAitchison etaD . I2OO0I : iBeardah etall . 1200.4 I Sharp! l2006h . In this paper we take 
the view that the choice between RDA, LRA and other possibilities should depend at least in part on 
the data, and should not be purely a matter of a priori considerations. For this purpose we consider a 
framework in which RDA and LRA relate to two special cases of a one-parameter family of transformations 
of the simplex ([T]) . The idea is that we should explore this family of transformations when deciding which 
approach to use for a given compositional dataset because, for different datasets, d ifferent transformati ons 
may be preferable. For a more extensive discussion of this framework we refer to Tsagris et al. ( 201ll ). 

The outline of this paper is as follows. Section 2 contains a short and informal comparison of RDA 
and LRA, while in section 3 we see the contrasting results of applying RDA and LRA to some examples 



involving real and artificial data. In section 4, we describe this family of power transformations, discuss 
its relationship with RDA and LRA and show an example of analysis based on these transformations. 
Finally, in section 5, we present our conclusions. 



2 An informal comparison of RDA and LRA 
2.1 Simplicial distance 

AitchisonI (j 19921 ) argues that a simplicial metric should satisfy certain properties. These properties include 



1. Scale invariance. The requirement here is that the measure used to define the distance between 
two compositional data vectors should be scale invariant, in the sense that it makes no difference 
whether, for example, the compositions are represented by proportions or percentages. 

2. Subcompositional dominance. To explain this we consider two compositional data vectors, and select 
subvectors from each consisting of the same components. Then subcompositional dominance means 
that the distance between the subvectors is always less than or equal to the distance between the 
original compositional vectors. 

3. Perturbation invariance. The perturbation of one compositional data vector by another is defined 
by © below. The requirement here is that the distance between compositional vectors x and w 
should be the same as distance between x ©o p and w ©o P> where the operator ©o is defined in ([6]) 
and p is any vector with positive components. 

RDA amounts to using the Euclidean distance as a metric in the simplex. In RDA the distance 
between two vectors x and w G S d is 
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In LRA the releva nt metric is the Euclidean distance applied to the log-ratio transformed data 
dAitchisonl . Il983l . Il986l l . [n mathematical terms the expression 
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AitchisonI (|l986h has shown that Alra satisfies the three properties above. It is straightforward to see 



that, in contrast to Alra, Arda does not satisfy any of these properties (although Arda is equivariant 
with respect to a scale change, in an obvious sense). On this basis, Aitchison argued that the metric 
Alra is preferable to Arda- However, although at first glance properties 1-3 seem to be reasonable 
requirements, there will sometimes be a price to be paid for using Alra- I n particular, this choice and 
similarly the choice of Arda, ties us to a specific geometry on the simplex which may or may not be 
appropriate for a given dataset. Our thinking here is that, ideally, one should work with a geometry on 
the simplex in which the structure of the data is "as nice as possible". In the data examples considered 
in section 3, niceness means linear structure and a measure of central tendency which lies in the main 
body of the sample, though of course different definitions of niceness could be adopted. The main point, 
however, is that one does not know in advance what the most appropriate geometry on the simplex is, 
because it will depend on the data. 
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2.2 Measures of central tendency and simplicial addition 



The two approaches lead to different definitions of measures of central tendency. The means specified 
in (J5]) and below are Frechet means with respect to Arda and Alra, respectively. The Frechet 
mean on a metric space (A4,dist), with the distance between p, q € M. given by dist (p, q), is defined 



by argmin^^x dist(X,/i) z 



and argmin^ ^™ =1 dist (xj, /i) 2 in the population and finite sample cases, 
respectively. In the geometry determined by RDA, we define the simplicial analogue of addition, denoted 
by ©i, as vector addition followed by rescaling to fix the sum of components to be 1. 



x ©i w = C |{xi + 



where C is the closure operation C (x) = x/ (xi + . . . + xd). The mean in this case is defined as the simple 
arithmetic mean of the components: 
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In the geometry defined by LRA, the analogue of addition of two compositional vectors x and w, 
known as perturbation of x and w which we denote by x ©o w, is given by the component- wise product 
followed by the closure operation: 



X ©o W = C {{XiWi} i=1 D j . 



(6) 



In the LRA case, the Frechet mean is given by the vector consi sting of the sample geometric mean of each 
component followed by the closure operation ( Aitchison . 19891 ): 
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2.3 Further discussion 

LRA works very well when the data follow the logist ic normal distribution; see for instance Hongite, 
Kongite and Boxite data analysed by Aitchison ( 19861 ). However it is not necessarily clear whether fo r 
data with different distributions that the LRA is still the best approach. Baxter and Freestone ( 20061 ) 
performed principal component analysis on both simulated and real data. The conclusion they reached 
was, in the examples considered, the RDA produced more archaeologically interpretable results than LRA. 
In a previous paper Beardah et al. (j2003l ) mention that the use of PCA on standardized data may be 
regarded as incorrect by proponents of LRA but did recover interpretable structure whereas the use of 
the log-ratio transformation failed to detect it. The standardization of the data after the centred log-ratio 
transformation worked well but the produced results are similar to the standardized data analysis for the 
most part. 

RDA has another potential advantage: in contrast to the LRA mean in ([7|), the RDA mean ([5|) is 
well-defined even when the data have some components equal to zero. 

A general transformation applied to compositional data is the Box-Cox transformation applied to 
ratios of components 
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The X£> stands for the last component but with relabelling any component can play the role of the 
common diviso r. As A — > 0, the above expression tends to the "additive log-ratio transformation" which 
was defined by lAitchison as an alternative transformation to the centred log-ratio transformation 
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In iBarcelo ei~all Jl99^) the Box-Cox transformation defined in ([8]) is examined and the conclusion was 
that some samples cannot be adequately modelled using the additive log-ratio transformation defined in 
([9]), in the s ense th a t the value of A should not be restricted to zero. 

Recently IShard (|2006h suggested the graph median a measure of central tendency alternative to the 
geometric or the arithmetic mean. His purpose was the proposal of a suitable alternative to the closed 
geometric mean when the latter fails in the sense that it does not lie within the body of the data. 



3 Some examples 

We now present some examples with real and artificial data. In the first example the use of LRA is to be 
preferred whereas RDA seems to work better in the second example. 



3.1 Example 1: Hongite data 

We created a subcomposition taking the first 3 components of Hongite data lAitchison and plotted 



it in Figure [TJ The closed geometric mean in ([7]) clearly lies within the body of the data, whereas 
the arithmetic mean in ([5]) has failed in this example as it lie s outside t he bo dy. The assumption of 
logistic normality based upon the battery of tests suggested by Aitchison ( 19861 ) is not rejected for this 
composition. 



A closed geometric mean 
-!- arithmetic mean 
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Figure 1: Ternary diagram of a subcomposition of the Hongite data. The closed geometric mean (ffj) and the arithmetic mean 
((U are presented. 
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3.2 Example 2: Artificial data 



Here, we created a dataset in which is concentrated on a straight line when plotted in a ternary diagram; 
see Figure [2 In this case, the data do not follow the logistic normal distribution based upon the battery 
of tests suggested by lAitchison (|l986h . The arithmetic mean ([5]) is a good measure of central tendency, 
whereas the closed geometric mean (J7]) has failed because it lies off the line around which the data are 
concentrated. 



A closed geometric mean 
-!- arithmetic mean 




Figure 2: Ternary diagram of artificial data showing an example where the geometric mean appears to be an unsuitable 
measure of central tendency. 

Our conclusions from the examples considered above are that in the first case, the approach based 
on LRA and the closed geometric mean seems more appropriate, while in the second case RDA and the 
arithmetic mean seems better. In general, the question of interest is to decide whether to use LRA, RDA 
or perhaps another possibility. We believe that this decision should depend on the nature of the data at 
hand. 

4 A family of power transformations on the simplex 

We now consider a Box-Cox type transformation in terms of which the RDA and LR A can be considered 
special cases. Consider the power transformation introduced by Aitchison ( 19861 ) 



as 



u = { Ui } i=1> ..., D = \=n^\ , (10) 

I 2^=1 %j ) i= i v .. iD 

and in terms of this define: 

z = - (Du - 1) H T , (11) 

a 

where we take H to be the d x D Helmert sub-matrix mentioned in the Introduction. Note that the 
a-transformed vector u in (|10p remains in the simplex S d , whereas z is mapped onto a subset of R d . Note 
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that (llip is simply a linear transformation of (jlOh , Moreover as a — > 0, (llip converges to the isometric 
log-ratio transformation. For a given a we can define a simplicial distance: 
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Associated with this one-parameter family of distances is the family of Frechet means 
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This agrees with ([5]) when a = 1 and it agrees with (JT]) when a = 0. Specifically, as a — > 0, the a- 
distance (|12|) and Frechet mean (|13p converge to Aitchison's distance dU) and the closed geometric mean 
([7]), respectively. When a = 1, the Frechet mean is equal to the arithmetic mean of the components ([5|) 
and the a-distance is equal to the Euclidean distance in the simplex ([3]) multiplied by D. 

The choice of the parameter a should depend on the type of analysis, we wish to perform. In dis- 
criminant analysis, for instance, the percentage of correct classification (estimated via cross-validation) 
could be us ed as a cri t erion to maximize with respect to the value of a. Similarly in linear regression the 
pseudo-i? 2 ( Aitchison . 1986) could serve as a criterion. 

Alternatively, if we are prepared to assume that the data follow a parametric model after an (unknown) 
a-transformation, we could use the profile log-likelihood to choose a. As in the Box-Cox transformation, 
the value of the parameter is selected by maximizing the p rofile log-l i keliho od of the transformed data. 
One possibility, which we adopted with the Arctic lake data ( Aitchison . 19861 ) in Figure [31 was to treat the 
data as being from a (singular) multivariate normal after a suitable a-transformation. This approach has 
the obvious drawback that it ignores the fact that any multivariate normal will assign positive probability 
outside the simplex, which may or not be of practical importance, depending on how concentrated the 
normal distribution is. The profile log-likelihood of a applied to (jlip is the same as the log-likelihood 
applied to (|10p plus a constant equal to ^ log D, where n and D denote the sample size and the number of 
components of th e composition respectively. Figure shows the ternary diagram of the Arctic lake data 
dAitchisonl . E986l '). The profile log-likelihood of a for this data was maximized when a = 0.362. Thus the 
three Frechet means for a = 0, 0.362 and 1 were calculated and plotted as well. 
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Figure 3: Ternary diagram of the Arctic lake data. The three Frechet means are plotted. 

5 Conclusions 

Two standard approaches to compositional data analysis, referred to in the paper as RDA and LRA, 
were discussed and compared. On the basis of simple numerical examples and further considerations we 
concluded that which approach, if either, is to be preferred should depend on the data under study. We 
then considered a framework involving a one-parameter family of Box-Cox type power transformations 
which includes the RDA and LRA approaches as special cases. Depending on the purposes of the analysis, 
we may then choose an optimal value of a by optimizing a suitable criterion, and/or explore the practical 
implications of different choices of a. This framework offers greater flexibility than simply adopting one 
method or another. 
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Appendix 



Table 1: Artificial data used in example 2 
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